A Comparative Study on Representation of Web Pages in Automatic Text Categorization

نویسندگان

  • Seyda Ertekin
  • C. Lee Giles
چکیده

With many web sites appearing everyday, it has become increasingly difficult to keep the web directories up-to-date and growing. The interest in the usage of machine learning on automatic text categorization is further stimulated with this intensive growth of World Wide Web. We believe that Web page classification is significantly different from a traditional text classification because of the presence of some additional information, provided by the HTML structure and by the presence of the hyperlinks. In this paper, our objective is to analyze different combinations to represent the training documents and the test documents for SVM classifier. Our experiments show that in addition to the content of the web site, using further the META data and the extended inbound anchor text information in representing the Web sites, enhances the performance of the classification. Moreover, utilizing the expected entropy loss values for the purpose of weighting the term frequencies in the feature vector provides further performance enhancement in SVM classifier.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Automated multi-label text categorization with VG-RAM weightless neural networks

In automated multi-label text categorization, an automatic categorization system should output a label set, whose size is unknown a priori, for each document under analysis. Many machine learning techniques have been used for building such automatic text categorization systems. In this paper, we examine virtual generalizing random access memory weightless neural networks (VG-RAM WNN), an effect...

متن کامل

Using neighborhood information for automated categorization of Web pages

In this paper we discuss several issues related to the influence of expansion of a Web document representation on quality of topical categorization of Web pages. We consider a Web page expansion by using text content of it’s linking pages. We show that naive expansion can grab too much noise and essentially harm categorization results. We present the approach to automated pruning of linking Web...

متن کامل

Arabic text categorization: a comparative study of different representation modes

The quantity of accessible information on Internet is phenomenal, and its categorization remains one of the most important problems. A lot of work is currently focused on English rightly since; it is the dominant language of the Web. However, a need arises for the other languages, because the Web is each day more multilingual. The need is much more pressing for the Arabic language. Our research...

متن کامل

Presenting a method for extracting structured domain-dependent information from Farsi Web pages

Extracting structured information about entities from web texts is an important task in web mining, natural language processing, and information extraction. Information extraction is useful in many applications including search engines, question-answering systems, recommender systems, machine translation, etc. An information extraction system aims to identify the entities from the text and extr...

متن کامل

Web Classification Approach Using Reduced Vector Representation Model Based on Html Tags

Automatic web page classification plays an essential role in information retrieval, web mining and web semantics applications. Web pages have special characteristics (such as HTML tags, hyperlinks, etc....) that make their classification different from standard text categorization. Thus, when applied to web data, traditional text classifiers do not usually produce promising results. In this pap...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006